Legend: df = dataframe pd = pandas
pd.read_csv("file.csv")
df.describe() <- Very Useful
df.columns <- Read Headers (names of each column)
^- Output: Index(['text'], dtype='object')
Quick Previews
df.head("3")df.tail("2")Headers
df['text'] <- Read each columns in the mentioned column namedf['text1', 'text2'] Reading multiple columnsRows
df.iloc[1] <- Reading the first rowdf.iloc[1,2] <- Read row ∩ columnIterating to each row
for index, row in df.iterrows():
print(index, row['Name'])
df.loc[df['Type 1'] == "Grass]
df.sort_values(ascending=[False])df.sort_values(['Type 1, hp']) <- Multiple columns are allowed Data Frames
(looks like an excel table)
We can think of Data Frames as a combination of multiple series
index=[] is basically the row
columns is columns
import pandas as pd
certificates_earned = pd.DataFrame({
'Certificates': [8, 2, 5, 6],
'Time (in months)': [16, 5, 9, 12]
})
names = ['Tom', 'Kris', 'Ahmad', 'Beau']
certificates_earned.index = names
isna(), notna(), and dropna().Functions: isna() notna() dropna()
Attributes: s.isna() s.notna() s.dropna()
dropna() removes them.The dropna function can remove rows or columns with missing values, and you can specify axis and thresholds.
pd.isnull(np.nan)
Pandas library's isnull function to check if a value is null
np.nan - (Not A Number)
Why would there be an "np.nan"?
np.nan is recognized by various libraries within the scientific Python ecosystem, including Pandas, SciPy, and scikit-learn. This makes it easier to work with missing data across different tools.Data frames can be analyzed using methods like info and shape to understand structure and missing values.
Syntax: DataFrame.fillna(value, *method=ffill, bfill*)
fillna method can replace missing values with specific values By default, it returns a new DataFrame with filled values.
Method Values:
ffill - copies the current value to it's forward's missing value in the same COLUMN
bfill - carries the current value to it's backward's missing value in the same COLUMN
Categorical column cleaning involves using unique or value_counts to identify invalid values, followed by replacing or fixing them.
For more complex fixes, coding skills might be required, such as when handling ages with typographical errors.
Duplicates are a common concern in data analysis
Require defining what constitutes a duplicate value.
The Dataframe.duplicated() method in pandas helps identify duplicate values based on specified rules.
subset=[] attribute is used to narrow down selection # Check for duplicated rows based on specific columns
duplicates_subset = df.duplicated(subset=['Name', 'Age'])
Returns BOOLEAN Values:
True: Specifically, it marks an element as True if it's the same as a previous element/sFalse: If current value isn't a duplicate from the previous element/sDataframe.drop_duplicates() removes duplicate rows from DataFrame based on certain criteria.Keep Parameters for attributes:
-> keep='first, last, false')-> -> first occurrence (default), last occurrence, ALL DUPLICATES -> -> can be put as parameter todrop_duplicates()andduplicated()`
Created: 2024-03-03